Sequence Alignment, Mutual Information, and Dissimilarity Measures for Constructing Phylogenies

نویسندگان

  • Orion Penner
  • Peter Grassberger
  • Maya Paczuski
چکیده

BACKGROUND Existing sequence alignment algorithms use heuristic scoring schemes based on biological expertise, which cannot be used as objective distance metrics. As a result one relies on crude measures, like the p- or log-det distances, or makes explicit, and often too simplistic, a priori assumptions about sequence evolution. Information theory provides an alternative, in the form of mutual information (MI). MI is, in principle, an objective and model independent similarity measure, but it is not widely used in this context and no algorithm for extracting MI from a given alignment (without assuming an evolutionary model) is known. MI can be estimated without alignments, by concatenating and zipping sequences, but so far this has only produced estimates with uncontrolled errors, despite the fact that the normalized compression distance based on it has shown promising results. RESULTS We describe a simple approach to get robust estimates of MI from global pairwise alignments. Our main result uses algorithmic (Kolmogorov) information theory, but we show that similar results can also be obtained from Shannon theory. For animal mitochondrial DNA our approach uses the alignments made by popular global alignment algorithms to produce MI estimates that are strikingly close to estimates obtained from the alignment free methods mentioned above. We point out that, due to the fact that it is not additive, normalized compression distance is not an optimal metric for phylogenetics but we propose a simple modification that overcomes the issue of additivity. We test several versions of our MI based distance measures on a large number of randomly chosen quartets and demonstrate that they all perform better than traditional measures like the Kimura or log-det (resp. paralinear) distances. CONCLUSIONS Several versions of MI based distances outperform conventional distances in distance-based phylogeny. Even a simplified version based on single letter Shannon entropies, which can be easily incorporated in existing software packages, gave superior results throughout the entire animal kingdom. But we see the main virtue of our approach in a more general way. For example, it can also help to judge the relative merits of different alignment algorithms, by estimating the significance of specific alignments. It strongly suggests that information theory concepts can be exploited further in sequence analysis.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

CAFE: aCcelerated Alignment-FrEe sequence analysis

Alignment-free genome and metagenome comparisons are increasingly important with the development of next generation sequencing (NGS) technologies. Recently developed state-of-the-art k-mer based alignment-free dissimilarity measures including CVTree, $d_2^*$ and $d_2^S$ are more computationally expensive than measures based solely on the k-mer frequencies. Here, we report a standalone software,...

متن کامل

Sequence alignment and mutual information

Background: Alignment of biological sequences such as DNA, RNA or proteins is one of the most widely used tools in computational bioscience. All existing alignment algorithms rely on heuristic scoring schemes based on biological expertise. Therefore, these algorithms do not provide model independent and objective measures for how similar two (or more) sequences actually are. Although informatio...

متن کامل

Multiple Sequence Alignment Errors and Phylogenetic Reconstruction

.........................................................................................................................1 Chapter 1: Introduction................................................................................................5 Sequence evolution...................................................................................................6 Alignment Reconstruction.............

متن کامل

Optimal Word Sizes for Dissimilarity Measures and Estimation of the Degree of Dissimilarity Between DNA Sequences Running Head: Optimal word size and degree of dissimilarity

Motivation: Several measures of DNA sequence dissimilarity have been developed. The purpose of this paper is threefold. Firstly, we compare the performance of several word-based or alignment-based methods. Secondly, we give a general guideline for choosing the window size and determine the optimal word sizes for several word-based measures at different window sizes. Thirdly, we use a large-scal...

متن کامل

RILogo: visualizing RNA-RNA interactions

SUMMARY With the increasing amount of newly discovered non-coding RNAs, the interactions between RNA molecules become an increasingly important aspect for characterizing their functionality. Many computational tools have been developed to predict the formation of duplexes between two RNAs, either based on single sequences or alignments of homologous sequences. Here, we present RILogo, a program...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره 6  شماره 

صفحات  -

تاریخ انتشار 2011